Data Presentation and Formulation of the Problem¶

🔍 The Problem:¶

  • The company needs to reduce its costs in the face of the drop in the value of pulp in the international market.
  • One of the fronts in this endeavor is to reduce the maintenance costs of its assets, making it more precise and objective, working within what is necessary, without this harming production, of course.

🎲 The Data:¶

  • For this purpose, the maintenance team provided us with a historical series of each asset that failed at some point, totaling 100 assets, and another table with another 100 assets to test the model's fidelity.

🎯 The main goal is to evaluate the feasibility of a model capable of predicting the maintenance of each asset 20 cycles in advance.¶

🚀 The Results:¶

  • It was possible to create two models with the prediction of up to 20 cycles in advance of equipment failure.
  • These two models have different principles and therefore can generate different results depending on the objective.
  • There is not enough data and information to objectively calculate the cost savings and define which is the best model.

📊 Exploratory Data Analysis¶

Understanding and visualizing the data

🤓 So we have this table:

  • The identification of each asset
  • A runtime counter
  • The settings used
  • And the measurement of 21 sensors in each of these assets
Asset id runtime setting_1 setting_2 setting_3 Tag1 Tag2 Tag3 Tag4 Tag5 ... Tag12 Tag13 Tag14 Tag15 Tag16 Tag17 Tag18 Tag19 Tag20 Tag21
0 1 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 ... 521.66 2388.02 8138.62 8.4195 0.03 392 2388 100.0 39.06 23.4190
1 1 2 0.0019 -0.0003 100.0 518.67 642.15 1591.82 1403.14 14.62 ... 522.28 2388.07 8131.49 8.4318 0.03 392 2388 100.0 39.00 23.4236
2 1 3 -0.0043 0.0003 100.0 518.67 642.35 1587.99 1404.20 14.62 ... 522.42 2388.03 8133.23 8.4178 0.03 390 2388 100.0 38.95 23.3442
3 1 4 0.0007 0.0000 100.0 518.67 642.35 1582.79 1401.87 14.62 ... 522.86 2388.08 8133.83 8.3682 0.03 392 2388 100.0 38.88 23.3739
4 1 5 -0.0019 -0.0002 100.0 518.67 642.37 1582.85 1406.22 14.62 ... 522.19 2388.04 8133.80 8.4294 0.03 393 2388 100.0 38.90 23.4044

5 rows × 26 columns

🆔 ⏳ Assets and Runtimes¶

We have 20631 observations, each one related to an asset execution.

Let's visualize the number of operations performed for each asset and observe some general statistics about it.

runtime
count 100.000000
mean 206.310000
std 46.342749
min 128.000000
25% 177.000000
50% 199.000000
75% 229.250000
max 362.000000

The statistics suggest that, on average, assets last between 199 and 209 runtimes. We can also see that the asset that failed the earliest was consumed only 128 times, and the longest was used 362 times.

⚙️ Settings¶

The second layer of the dataset is the Settings used to run each asset, let's see:

setting_1 setting_2 setting_3
count 20631.000000 20631.000000 20631.0
mean -0.000009 0.000002 100.0
std 0.002187 0.000293 0.0
min -0.008700 -0.000600 100.0
25% -0.001500 -0.000200 100.0
50% 0.000000 0.000000 100.0
75% 0.001500 0.000300 100.0
max 0.008700 0.000600 100.0

As we can see from the standard deviation, there are no considerable changes in the settings pattern. This suggests (and only suggests as we don't have more details on what exactly these settings are) that the assets are used within a stable production line with little variation.

🏷️ Tags¶

Last but not least we have all the 21 tags representing the reading of each monitoring sensor of the asset, let's take a look:

count mean std min 25% 50% 75% max
Tag1 20631.0 518.670000 0.000000e+00 518.6700 518.6700 518.6700 518.6700 518.6700
Tag2 20631.0 642.680934 5.000533e-01 641.2100 642.3250 642.6400 643.0000 644.5300
Tag3 20631.0 1590.523119 6.131150e+00 1571.0400 1586.2600 1590.1000 1594.3800 1616.9100
Tag4 20631.0 1408.933782 9.000605e+00 1382.2500 1402.3600 1408.0400 1414.5550 1441.4900
Tag5 20631.0 14.620000 1.776400e-15 14.6200 14.6200 14.6200 14.6200 14.6200
Tag6 20631.0 21.609803 1.388985e-03 21.6000 21.6100 21.6100 21.6100 21.6100
Tag7 20631.0 553.367711 8.850923e-01 549.8500 552.8100 553.4400 554.0100 556.0600
Tag8 20631.0 2388.096652 7.098548e-02 2387.9000 2388.0500 2388.0900 2388.1400 2388.5600
Tag9 20631.0 9065.242941 2.208288e+01 9021.7300 9053.1000 9060.6600 9069.4200 9244.5900
Tag10 20631.0 1.300000 0.000000e+00 1.3000 1.3000 1.3000 1.3000 1.3000
Tag11 20631.0 47.541168 2.670874e-01 46.8500 47.3500 47.5100 47.7000 48.5300
Tag12 20631.0 521.413470 7.375534e-01 518.6900 520.9600 521.4800 521.9500 523.3800
Tag13 20631.0 2388.096152 7.191892e-02 2387.8800 2388.0400 2388.0900 2388.1400 2388.5600
Tag14 20631.0 8143.752722 1.907618e+01 8099.9400 8133.2450 8140.5400 8148.3100 8293.7200
Tag15 20631.0 8.442146 3.750504e-02 8.3249 8.4149 8.4389 8.4656 8.5848
Tag16 20631.0 0.030000 1.387812e-17 0.0300 0.0300 0.0300 0.0300 0.0300
Tag17 20631.0 393.210654 1.548763e+00 388.0000 392.0000 393.0000 394.0000 400.0000
Tag18 20631.0 2388.000000 0.000000e+00 2388.0000 2388.0000 2388.0000 2388.0000 2388.0000
Tag19 20631.0 100.000000 0.000000e+00 100.0000 100.0000 100.0000 100.0000 100.0000
Tag20 20631.0 38.816271 1.807464e-01 38.1400 38.7000 38.8300 38.9500 39.4300
Tag21 20631.0 23.289705 1.082509e-01 22.8942 23.2218 23.2979 23.3668 23.6184

It's a little messy to analyze all those numbers, let's make some viz

👀 So, what can we intuit from the combination of the statistical summary and the visualization of the sensors for each asset?

  1. There is no variation for some sensors (Tag1, Tag10, Tag18 and Tag19) along the data, as we can see by the recurrence of one same value with no standard deviation associated, so maybe we can drop them out to preserve only useful information and prevent further model overfitting.
  2. In this visualization for the first asset sensors, we can see a tendency in the variation indicating a future failure, usually in the final quarter of the data. This is a good indication that there is a pattern that the model can rely on to make the prediction.

💀 Remaining Useful Life (RUL)¶

To advance the analysis and the possibility of prediction, we can use the concept of remaining useful life - RUL. It will work as a coefficient to indicate how many cycles each asset has to failure, like a countdown. To calculate it is simple: as we already have information about the useful life of each of the assets, we just have to invert the count.

And we have the following distribution:

Or, we can also work with the RUL to better visualize the correlation between the sensor data and the failure.

We can also observe which sensors do not add relevant information to the construction of a linear model.

Conclusions 🤔¶

✅ There is a normalized distribution for the assets, indicating that we can make a statistical inference on the lifetime of the equipment.

✅ Settings have little or no impact on building a predictive model.

✅ We can use RUL calculation to correlate variables and predict future failure.

✅ Some sensors present irrelevant data for the construction of a linear model, which we will discard at first.

🧠 Modeling¶

There are two possible approaches within predictive models: regression and classification.

Classification modeling is intended to signal whether the predicted value will be within a specific label. Let's start with him!

🔢 Logistic regression baseline¶

In our case, the first need is to predict whether the asset will be within the final 20 cycles before failure, a kind of indicator of the equipment's health.

Here we will signal that the measurements for the final 20 cycles are marked with the number 0, and when it passes this threshold it will change to 1.

🤖 Doing some machine learning magic 🤖

TheHangOverZachGalifianakisGIF.gif

              precision    recall  f1-score   support

           0       0.84      0.83      0.83       375
           1       0.98      0.98      0.98      3752

    accuracy                           0.97      4127
   macro avg       0.91      0.91      0.91      4127
weighted avg       0.97      0.97      0.97      4127

Excellent job! With a simple logistic regression we can recognize a failure 20 cycles in advance with 98% certainty! 😯

📈 Regression baseline¶

What if we wanted not only to know if the asset is healthy or unhealthy, but to know more precisely how many cycles it is at the end of its useful life?

This is where regression comes in, not just working with labels but with numbers more specifically.

Let's run it! 🏃

RMSE:44.33391060979522, R2:0.5698000071522076

R² can be understood, in a very superficial way, as the explanatory level of the model around the variable that is our target. And the RMSE is the margin of error, in number of runtimes of the "RUL".

So, a simple linear regression model can explain around 56.98%, with a margin of error of up to 44 cycles. 🥶

It doesn't seem very accurate, let's see if we can get better results.

Here we can apply a more accurate understanding of what the RUL is from the exploratory analysis, when we observe the degradation in the sensors.

📉 RUL is not linear all the time! It happens at some point in time.

As we do not have more information about what the measurement of the sensors is, I will apply that the regression only starts to be done from the 120th cycle.

Annnd ta dããã!

RMSE:18.763818622288873, R2:0.7729827657293247

The model has a considerable improvement, being able to explain 77% of the variance of the RUL, with 18 cycles of error margin! 🥳

🤯 XGBoost¶

XGBoost is a much more refined and complex model, with great performance for statistical problems, let's try it.

RMSE:18.33642899230519, R2:0.7832066791831965

We have a improvement in R² (accompanied by a reduction in RMSE). But the great advantage of working with XGBoost is using its hyperparameters, let's run a random search in search of the best values for these parameters:

Here we go!

RMSE:17.38761409374267, R2:0.8050620649150194

And again we can see an improvement in efficiency! 🤩

This is our final model: 80% explanation, with a margin of error of 17 cycles. 👌

From this model I will make the prediction for the test dataset and save these values in a .csv file for proper comparison with the true values.

💡 Conclusion¶

  • The data allows us to make predictions with a good margin of accuracy.
  • The maintenance team can plan their routines more accurately using any of the models (classification or regression).
  • There are always trade-offs: the success of using the model is associated with maintenance costs.
  • The classification model is, today, more accurate, but it doesn't allow a correct prediction of when the failure will occur.
  • The regression model is more suitable in this case, but with a margin of error of 17 cycles, it would not present a great gain, but has an advantage that can be improved over time.

StonksUpStongsGIF.gif